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Abstract —The increasing data rates expected to be of the 
order of Gb/s for future wireless systems directly impact the 
throughput requirements of the modulation and coding subsys¬ 
tems of the physical layer. In an effort to design a suitable 
channel coding solution for 5G wireless systems, in this brief 
we present a massively-parallel 2.48Gb/s Quasi-Cyclic Low- 
Density Parity-Check (QC-LDPC) decoder implementation op¬ 
erating at 200MHz on the NI USRP-2953R, on a single FPGA. 
The algorithmic innovations leading to an architecture-aware 
design, central to this work, are presented in Q. The high- 
level description of the entire massively-parallel decoder was 
translated to a Hardware Description Language (HDL), namely 
VHDL, using the algorithmic compiler Q in the National 
Instruments Lab VIEW™ Communication System Design Suite 
(CSDS™) in approximately 2 minutes. This implementation not 
only demonstrates the scalability of our initial work in fTl 
but also, the rapid prototyping capability of the Lab VIEW 
CSDS™ tools. As per our knowledge, at the time of writing this 
paper, this is the fastest implementation of a standard compliant 
QC-LDPC decoder on a USRP using an algorithmic compiler. 

Index Terms —5G, mm-wave, SDR, USRP, QC-LDPC, layered 
decoding. 

1. Introduction 

Wireless data traffic is expected to increase by a 1000 
fold 0 by the year 2020 with more than 50 billion devices 
connected to these wireless networks with peak data rates 
upto ten Gb/s 0. To address these challenges, the next 
generation of wireless cellular technology being envisioned 
and researched today is collectively termed as Beyond-4G (B- 
4G) and 5G. However, the envisioned operation of 5G systems 
in the mm-wave (30-300GHz) spectrum comes with challenges 
such as, reliance on line of sight (LOS) communication, short 
range of communication, significantly increased shadowing 
loss and rapid fading in time, necessitating techniques such as 
large antenna arrays and rapidly adaptive beamsteering. Prom 
a physical layer perspective, the processing budget (especially 
time) available to the channel encoder and decoder will further 
decrease (relative to current generation systems such as 4G 
LTE). 

With this in mind, in our ongoing research we focus on 
high-throughput and low-latency error control coding solutions 
(primarily based on Low-Density Parity-Check (LDPC) 0 
family of codes) specially suited to 5G mm-wave systems. 
At the time of writing this paper, a detailed progress report 
focusing on the algorithmic innovations for high-throughput 


and a subsequent case study leading to a 608Mb/s (at 260MHz) 
standard compliant Quasi-Cyclic (QC) LDPC decoder is pre¬ 
sented in ||T| and 0. To adapt to the evolving specifications 
for 5G technology, implementations for our ongoing research 
must be reconfigurable and scalable, and must exhibit state- 
of-the-art performance, hence we choose the PPGA approach 
to developing hardware instead of the ASIC approach. 

The Universal Software Radio Peripheral (USRP) is a widely 
used Software Defined Radio (SDR) system that is a fiexible 
and an affordable transceiver with the potential to turn a 
standard host (such as a PC) into a powerful wireless pro¬ 
totyping system. The availability of state-of-the-art, highly 
reconfigurable hardware platforms (such as the PPGA) on 
the USRP has opened up a huge space for implementing 
theoretical algorithms at high-speeds, crucial for systems such 
as those required by 5G wireless. 

In this brief we present an application of the work in |[T|, 
a 2.48Gb/s PPGA-based QC-LDPC decoder implemented on 
the NI USRP-2953R (which has the Xilinx Kintex? (410t) 
PPGA) using the FPGA IP compiler in LabVIEW™ CSDS™. 
Massive-parallelization was accomplished by employing 6 
decoder cores in parallel without any modification at the HDL 
level. This compiler translated the entire high-level description 
of the parallelization (done in a graphical algorithmic datafiow 
language) to VHDL and further generated an optimized hard¬ 
ware implementation from the algorithmic description. The 
main contributions of this work are: (1) demonstration of the 
scalability of our decoder architecture in 0 (2) the ability of 
the LabVIEW™ CSDS™ tools to rapidly prototype high-level 
algorithmic description onto PPGA hardware. 

The remainder of this paper is organized as follows. Section 
|n| outlines the construction of QC-LDPC codes and the decod¬ 
ing algorithm used for the implementation. A brief overview of 
the techniques leading to the software-pipelined decoder core 
in 0 is given in Section |nl| The process of implementing the 
2.48Gb/s decoder and the performance results are detailed in 
Section s Wv\ and [V| respectively, and finally we conclude with 
Section IVII 

H. Quasi-Cyclic LDPC Codes and Decoding 

Mathematically, given k message bits, an LDPC code is a 
null-space of its m x n PCM H, where m denotes the number 
of parity-check equations or parity-bits and n{= k m) 


denotes the number of variable nodes or code bits |[7|. In 
the Tanner graph representation (due to Tanner dx H is the 
incidence matrix of a bipartite graph comprising of two sets: 
the check node (CN) set of m parity-check equations and the 
variable node (VN) set of n variable or bit nodes; the CN 
is connected to the VN if = 1, 1 < i < m and 

1 < i < 

QC-LDPC codes are represented by an mi) x rib base matrix 
which comprises of cyclically right-shifted identity and 
zero submatrices both of size 2 : x 2 : where, ^ G , 1 < 

4 < ^6 and 1 < < rib, the shift value,s = Jib{ib,jb) ^ 

5 = { — 1} U {0,... 2 ; — 1} The PCM matrix H is obtained by 
expanding H 5 using the mapping. 

Is, s G 5\{ —1} 

0, s G { — 1} 

where. Is is an identity matrix of size 2 : which is cyclically 
right-shifted by 5 = ^b{h^jh) and 0 is the all-zero matrix 
of size 2 ; X 2 ;. Owing to this structure provided by QC-LDPC 
codes, the decoding of these codes becomes much simpler in 
hardware (mainly due to the simplified interconnect complex¬ 
ity) compared to unstructured LDPC codes. We believe that the 
family of structured LDPC codes are highly likely candidates 
for 5G systems. Thus, to demonstrate the initial phase of our 
FPGA decoder architecture we provide a case study 

based on the QC-LDPC code specified in the IEEE 802.1 In 
(2012) standard ||^, the throughput of which well surpasses 
the requirement of the standard. 

LDPC codes can be decoded using message passing (MP) 
or belief propagation (BP) 0 m on the bipartite Tanner 
graph where, the CNs and VNs communicate with each other, 
successively passing revised estimates of the log-likelihood 
ratio (LLR) associated, in every decoding iteration. The de¬ 
coder in Q employs the efficient decoding algorithm in (0, 
with pipelined processing of layers based on the row-layered 
decoding technique in p^ . A stepwise description of the 
version of the algorithm we have employed is given in Q. 

III. Software Pipelined Decoder Architecture 

Without loss of generality, in Q we have presented several 
strategies to achieve high-throughput for the decoder architec¬ 
ture. To understand how software-pipelining was accomplished 
for a single core (amongst the 6 parallel cores in this imple¬ 
mentation) and for the sake of continuity and completeness, 
we provide an overview of the layer decoder architecture from 
(Tj below. 

From the perspective of CN processing, two or more CNs 
can be processed at the same time (i.e. they are independent 
of each other) if they do not have one or more VNs (code bits) 
in common. In terms of H, an arbitrary subset of rows can 
be processed at the same time provided that, no two or more 
rows have a 1 in the same column of H. This subset of rows is 
termed as a row-layer (hereafter referred to as a layer). In other 
words, given a set £ = {Li, I/ 2 ,..., Lj} of / layers in H, 
\/u G {1,2,...,/} and Vi, i' G Lu, then, N{i) D = 0. 
Owing to the structure of QC-LDPC codes, \Lu\ = 2 ;. From 


the VN or column perspective, \Lu\ = z, = {1,2,...,/} 
implies that, the columns of H are also divided into subsets 
of size 2; (hereafter referred to as block columns) given by the 
set B = ..., Bj}, J = j = rib. Since, VNs in a 

block column may participate in CN equations across several 
layers, we further divide the block columns into blocks, where 
a block is the intersection of a layer and a block column. 

The 0 submatrices in H are defined as invalid blocks, where 
there are no edges between the corresponding CNs and VNs, 
and the submatrices I^ as valid blocks. In a conventional ap¬ 
proach to scheduling, message computation is done for all the 
valid and invalid blocks. To avoid processing invalid blocks, 
we propose an alternate representation of H 5 in the form of 
two matrices: f3j, the block index matrix and fdg, the block 
shift matrix which hold the index locations (column number 
of each block in a row or layer) and the shift values (defining 
the connections between the CNs and VNs) corresponding to 
only the valid blocks in H 5 , respectively. 

Fact 1. In the decoder architecture, CN and VN processing 
is performed by a single processing unit termed as the Node 
Processing Unit (NPU). The NPU is further split into two units 
namely, the Local NPU (LNPU) and the Global NPU (GNPU) 
to reduce the decoding complexity 

A naive way to schedule the layer-level processing is shown 
in Fig. [Ja). The outer for-loop executes / times, processing 
node metrics over all the layers. In the first inner for-loop, 
the GNPU output is first computed over the J blocks in each 
layer, as per the algorithm in and is then fed to the second 
for-loop where the LNPU produces the respective metrics for 
the same set of blocks. We call this the lx or the Baseline 
architecture. It is evident that one of the NPU idles while the 
other processes. To avoid idling, we use Fact 1 and process 
the GNPU and LNPU in a pipeline as shown in Fig. This 
version is called as the 2x or the Pipelined architecture. We 
would like to emphasize here that, pipelining was described 
in software at the algorithmic description level and not the 
HDL level. The algorithmic compiler translated the high-level 
description to an HDL description for the case study decoder 
implementation in a little over two minutes. 

Remark 1. Fig. (b) shows upto 6 layers (Li to Lq) in the 
pipeline. From the bound on the number of layers one can 
pipeline in this manner is derived in Q, for the QC-LDPC 
code in this case study the maximum number is 6. 

IV. Multi-core Decoder 

The decoder implementation based on the Pipelined (2x) ar¬ 
chitecture that achieves a throughput of 420Mb/s (at 200MHz) 
is hereafter referred to as a core. The core operates for 
mb ^ rib = 12 X 24, 2: = 27, 54 and 81 resulting in code lengths 
of n = 24 X 2 ; = 648, 1296 and 1944 bits respectively and a 
code rate R= ^. It is worthwhile to note that, for the Pipelined 
version of the decoder, pipelining was fully described in 
software. Moreover, the algorithm was described in a high- 
level language - graphical code in Lab VIEW™ (i.e. not in 
a hardware description language). The algorithmic compiler 
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Figure 1: Illustration of software pipelining of layers for the case study, (a) Layer-level processing view for the Baseline 
architecture showing the cascade of the GNPU and LNPU for-loops (b) Layer-level processing view for the Pipelined architecture 
showing the two for-loops operating in parallel for half of a decoding iteration. Note that, GNPU (LNPU) array refers to a 
collection of ^ GNPUs (LNPUs) which operate in parallel so as to exploit the structure provided by QC-LDPC codes as shown 



in LabVIEW™ CSDS™ translated the high-level description 
into a VHDL description. 

On account of the scalability and reconfigurability of the 
decoder architecture in Q, it is possible to achieve high 
throughput by employing multiple decoder cores in paral¬ 
lel. Fig. shows the top-level multi-core decoder virtual 
instrument (VI), where 6 cores are deployed on a single 
Xilinx Kintex? FPGA (410t). The high-level operation of the 
decoder is described in the steps below (corresponding to the 
highlighted sections in Fig. [^: 

1) Serial stream of the encoded data is read as frames 
from the host-to-target Direct Memory Access (DMA) 
mechanism. Here, host may be an arbitrary processing 
platform such as a PC or a real-time controller and target 
is the Xilinx Kintex? FPGA (410t) on the NI USRP- 
2953R. This data is subsequently stored in the Dynamic 
Random Access Memory (DRAM). 

2) Request frames from the DRAM. 

3) Read and buffer frames from the DRAM. 

4) Distribute incoming frames to the cores in a round-robin 
manner. 

5) Perform decoding with fixed-latency, parallel processing 
of frames staggered with respect to time. Buffer the 
decoded frames. 

6) Collect the decoded frames and serialize them with 
respect to the round-robin manner used in step (3). 

7) Write frames to the target-to-host DMA mechanism. 

V. Results 

The performance and resource utilization of the Baseline 
and the Pipelined version is compared in Table |T| The resources 
consumed by the Pipelined decoder are almost the same as 
that of the Baseline decoder, in spite of the l.bx increase in 
throughput performance. The 2.48Gb/s decoder was developed 
in stages, where at each stage a core was added (except 
for stage 3) and the performance and resource figures were 
recorded. The results of each stage are compared in Table 
The Bit Error Rate (BER) performance of the 2.48Gb/s version 


B Uncoded BPSK 

Q LOPC Decoder (Rate = 1 / 2 . a944,972) QC-LDPC codes in IEEE 802.11n 2012 (WiFO standard) 



Figure 3: Bit Error Rate (BER) performance comparison 
between uncoded BPSK (green) and the 2.48Gb/s, rate=l/2, 
QC-LDPC decoder (red) on the Nl USRP-2953R containing 
the Xilinx Kintex? (410t) FPGA. 

(with 6 cores) is shown in Fig. 

We have successfully demonstrated this work in IFFF 
GLOBFCOM'14 where an overall throughput of 2.06Gb/s 
was achieved by using five decoder cores in parallel on the 
Xilinx Kintex? (410t) FPGA in the NI USRP-2953R. 

VI. Conclusion 

This work validates the scalability of our decoder 
architecture in ||T| by deploying multiple decoder cores in 
parallel. The development was done using an algorithmic 
compiler that translated the high-level description of the 
decoding algorithm into an HDL in approximately 2 minutes. 
The standalone standard compliant decoder achieves an 
overall throughput of 2.48Gb/s at an operating frequency of 
200MHz on the Xilinx Kintex-? FPGA in the Nl USRP-2953R. 
With little or no modification this decoder can be applied to a 
large family of standard compliant QC-LDPC codes such as 
those specified in IEEE 802.16e and Digital Video Broadcast 





































Figure 2: Top-level VI describing the parallelization of the QC-LDPC decoder on the NI USRP-2953R containing the Xilinx Kintex? (410t) FPGA. 






















































































































































































Baseline 

Pipelined 

Throughput (Mb/s) 



290 

420 

Clock Rate (MHz) 



200 

200 

Time to generate VHDL (min) 

2.02 

2.08 

Total Compile Time (min) 



36 

Total Slice (%) 



26 

28 

LUT (%) 



16 

18 

FF (%) 



9 

10 

DSP (%) 



5 

5 

BRAM (%) 



11 

11 

Table I: Performance and resource utilization comparison 
the Baseline architecture with the Pipelined architecture 
the QC-LDPC decoder on the NI USRP-2953R containing 
Xilinx Kintex? (410t) FPGA. 

Cores 

1 

2 

4 

5 6 

Throughput (Mb/s) 

420 

830 

1650 

2060 2476 

Clock Rate (MHz) 

200 

200 

200 

200 200 

Time to VHDL (min) 

2.08 

2.08 

2.08 

2.02 2.04 

Total Compile (min) 

^ 36 

^60 

104 

132 145 

Total Slice (%) 

28 

44 

77 

85 97 

LUT (%) 

18 

28 

51 

62 73 

FF (%) 

10 

16 

28 

33 39 

DSP (%) 

5 

11 

21 

26 32 

BRAM (%) 

11 

18 

31 

38 44 
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Table II: Performance and resource utilization comparison 
for versions with varying number of cores of the QC-LDPC 
decoder implemented on the N1 USRP-2953R containing the 
Xilinx Kintex? (410t) FPGA. 
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